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Abstract 

This paper presents an incremental method 
for the tagging of proper names in 
German newspaper texts. The tagging 
is performed by the analysis of the 
syntactic and textual contexts of proper 
names together with a morphological 
analysis. The proper names selected by 
this process supply new contexts which 
can be used for finding new proper names, 
and so on. This procedure was applied 
to a small German corpus (50,000 words) 
and correctly disambiguated 65% of the 
capitalized words, which should improve 
when it is applied to a very large corpus. 

1 Introduction 

The recognition of proper names constitutes one 
of the major problems for the wealth of tagging 
systems developed in the last few years. Most of 
these systems are statistically based and make use of 
statistical properties which are acquired from a large 
manually tagged training corpus. The formation of 
new proper names, especially personal names, is very 
productive, and it is not fe asible to list t hem in a 
static lexicon. As Church ( Church, 198S ) already 
discussed for English, it is difficult to decide whether 
a capitalized word is a proper name if it has a 
low frequency (< 20), and so they were removed 
from the lexicon. But because they are highly 
individual, this is the case for most proper names. 
Furthermore, the problem of proper name tagging 
for German is not restricted to the disambiguation 
of sentence-initial words, because proper names and 
generic terms (normal nouns) are capitalized both 
at the beginning and within a sentence. Church 



suggested labelling words as proper nouns if they 
are "adjacent to" other capitalized words. This 
also holds for German proper nouns, but it is 
difficult to decide which of the capitalized words 
belong to the proper name and which not, e.g. is 
it a first name (as in "Helmut Kohl") or is it an 
apposition (as in "Bundeskanzler Kohl"), or is it 
a complex institutional name composed of several 
generic terms and a proper name (as in "Vereinigte 
Staaten von Amerika"). In this procedure, I use 
Church's heuristic for the selection of proper name 
hypotheses, which are evaluated on the basis of 
their syntactic and textual context together with 
a morphological analysis. The starting point of the 
analysis is a small database of definite minimal 
contexts like titles (e.g. "Prof.", "Dr.") and forms 
of address (e.g. "Herr", "Frau"), which increases 
with the processing of texts in which proper names 
are identified, and supplies new contexts which 
can be used to find new proper names and new 
contexts, etc.. This incremental method is applied to 
unrestricted texts of a small corpus (50,000 words) 
of German newspapers. 

2 Proper Name Acquisition 

From a psycholinguistic point of view it is possible 
that we memorize proper names better if we organize 
them in a hierarchy, in which each word would 
constitute a node whose subordinate nodes are its 
hyponyms ( Koss, 1990| ) . For example, we find in 



the semantic hierarchy in figure f SOCRATES as 
hyponym of PHILOSOPHER and PHILOSOPHER 
as hyponym of SCHOLAR, and each node may bear 
features describing properties of the node. 

One can observe that hyperonyms of names are 
used to identify or to introduce a proper name 
in texts. If the knowledge of a name cannot be 




SOCRATES lived 470-35 BC PLATO ^* lived 427-347 BC 

was condemned to death \ student of Socrates 

wrote down the dialogues 
with Socrates 

Figure 1: SOCRATES in a semantic hierarchy 



name scenes ( "Namenlandschaften" ) , helps us to 
recognize names describing places which belong to a 
certain district or scenery, e.g., cities in the Stuttgart 
area like "Tubingen" , "Reutlingen" , "Esslingen" 
have the common suffix -ingen. 

The morphological analysis (see section |3|) 
operates with a list of so-called onomastic suffixes 
to identify place names. 

3 Proper Name Tagging 

An overview of the tagging process is shown in figure 
2. 



presupposed, then the name is often introduced by 
an appositional construction (l)-(2) ( Hackel, 1986| ) 
and can be used without additional information (3)- 
(4) (Kalverkamper, 1978) later on. 



(1) der Vorsitzende des Verteidigungsausschusses, 
Biehle (CSU), hat Verteidigungsminister Wor- 
rier gebeten, ... 

(the chair of the defence committee, Biehle 
(CSU), asked the Minister of Defence Worrier 
to ...) 

(2) der SPD-Abgeordnete Gerster kritisierte, dafi 

(the SPD member of parliament Gerster 
criticized that ...) 

(3) In eincm Fcrnschreiben an Worrier, auficrtc 
Biehle am Dienstag, ... 

(in a telex to Worner, Biehle commented on 
Tuesday ...) 

(4) Gerster forderte eine Mindestflughohe von 300 
Metern 

(Gerster called for a minimal flying height of 
300 metres) 

The syntactic analysis (see section ||) operates 
on a small lexicon of definite minimal contexts of 
proper names (MC-lexicon) which are used in such 
appositional constructions and generates a lexicon 
of so-called potential minimal contexts (MCpot- 
lcxicon) . 



In addition there exist other methods ( Koss, 1987 ) 
for the acquisiton of proper names, two of which can 
be directly observed in the texts. The first method 
( "Lernpsychologische Sinnverleihung" ) tries to lend 
sense to the name in order to learn it, e.g. the name 
"Dusseldorf" is given the meaning of 'village'. Today 
it is a big city, but the compound part -dorf helps us 
to identify it as a proper name. The second method, 
the formation of name fields ( "Namenfelder" ) and 



corpus! i I 
MC-lexicon 



lAirpjslii 
PN-lexieori(j ) 
suffix/prefix list 



PREPROCESSING 



token izati on 



- disanihiguaiion of 
sentence beginning 

- words 

- lagging ofdelinite 
p roper names 



SYNTACTIC AND MORPHOLOGICAL 
ANALYSIS 



corpus(i) 
PN-Iexicon(j) 



PN-Iexicon(j) 
MCpot-lexieon(k) 



i_'oipiiMi) 

PN-lexieon{j) 

MCpot-lexicon(k) 



hypotheses 
'processing 



j=j+l, k = k+l PN-Iexicon(j) 

MCpot-lexicon(k) 



corpjs(i) 
PN-lexieon(j) 



~ TAGGING 



-* tagged corpus 



Figure 2: proper name tagging 



Preprocessing 

The corpus has to be preprocessed first of all. This 
includes the tokenization of the corpus in which all 
punctuation marks are separated from the words 
to allow the following disambiguation of sentence- 
initial words. This disambiguation uses a heuristic 



derived from the one used in CLAWS (Garside et al 



1987): if a sentence-initial word also occurs inside of 
a sentence with a lower case initial letter, then it 
is not a noun (normal noun or proper name) and 
represented with lower case letters. For this I use a 
list of all words with lower case initial letter found in 



the c orpus which is stored in an AVL-tree (|Wirth 



1983 ) for better searching and inserting. 

After this, a first run through the corpus is 
done to identify definite proper names occuring 
in the contexts of the MC-lexicon. Apart from 
appositons as mentioned above, this lexicon contains 
speech-embedding ( "redeeinbettende" ) verbs like 
"sagte"and "fragte" frequently used in political 
newspaper texts, as in: 



(5) die Abgeordnete Kelly sagte, ... 

(the member of parliament Kelly said, ...) 

(6) Heinlein fiigte hinzu, ... 
(Heinlein added, ...) 

(7) so fragte Apel 
(Apel asked) 

The MC-lexicon also contains prepositions and 
preposition frames to identify place names, as in: 

(8) bei Frankfurt 
(near Frankfurt) 

(9) aus Sollingen bei Baden-Baden 
(from Sollingen near Baden-Baden) 



10) im Raum Landshut 



(in the Landshut area) 

All proper names are stored in the PN-lexicon 
which is used during the entire processing. 

Syntactic and Morphological Analysis 

In the following analysis, the immediate syntactic 
and morphological context of all capitalized words 
is examined. If the capitalized word is already 
included in the PN-lexicon, then its immediately 
preceding context is stored as a potential minimal 
context in the MCpot-lexicon if it comprises 
one or more capitalized words. Cases where the 
proper name is marked as genitive are not 
considered because this could lead to wrong 
MCs (e.g., Aussage Worners, Besuch Lafontaines) . 
The collection of potential minimal contexts is 
also done in the hypotheses processing, which 
follows. For example, the proper name Worner 
supplies the MCs: Bundesverteidigungsminister, 
Verteidigungsminister, Minister, 
Nato-Generalsekretar. 

For the recognition of place names, a suffix list 
is used containing onomastic suffixes like -acker, - 
aich, -beuren, -hafen, -hausen, -stetten, -weiler and 
a prefix list containing prefixes like Mittel-, Ost-, 
West-, Zentral-. In addition to this the ending of 
the left capitalized word of two adjacents is checked 
for adjectival endings -er, -aner, as in: 

(11) Mainzer Landtag 

(the state parliament of Mainz) 

(12) Miinsteraner Parteitag 

(the party conference of Miinster) 



Node 


List 


Article 


ADN 


bei 

Nachr i cht enagentur 





Angaben 


nach 

Donnerstag 


1 


Belgien 


aus 
in 





Baum 


FDP-Politker 
FDP-Abgeordnete 






Table 1: contexts of capitalized words 

are stored in the PN-lexicon. The adjectival forms 
in (11)— (12) are considered as adjectives (following 



( [Fleischer, 1989| ), p. 265). 
Furthermore, loose appositional constructions 



( "loc kere appositionelle Konstruktionen" , ( Hackel 



1986 )) as in (13)-(14) are analyzed according to 
the patterns of noun phrases which occur before the 
proper name. 

(13) der Staatssekretar des 
Landesinnenministeriums, Basten, ... 

(the under-secretary of the Department of the 
Interior, Basten, ...) 

(14) der Chef des Schweizer Wehrministeriums, 
Bundcsrat K oiler, ... 

(the director of the Swiss Department of the 
Armed Forces, the minister of state Roller, ...) 

During this run through the corpus, a second 
AVL-tree is constructed in which all capitalized 
words are stored together with some information 
that can be useful for the hypotheses processing. 
For each word (node) there is a counter for all 
occurences of the word with an article and a list 
of all its immediately preceding words, if these are 
also capitalized or are prepositions (see table 1). 

Hypotheses Processing 

In this section of the procedure, hypotheses are 
generated and evaluated. A hypothesis may consist 
of two adjacent capitalized words or a preposition 
with a capitalized word. These hypotheses are 
evaluated on the basis of all occurences of the second 
word found in the corpus. 

A hypothesis of two capitalized words is rejected, 

if 

1. the left word is already in the PN-lexicon 

2. the right word is an inflected form which is not 
possible with PNs. 



If they also occur without this ending {Mainz, 
Miinster), then these forms are proper nouns and 



All other hypotheses are analyzed in the following 
way. If the left word is a MCpot or a derived form 



of a MCpot, then the right word is a proper name. 
For example "Senatsprasident Spadolini" is analyzed 
as proper name "Spadolini" with the apposition 
"Senatsprasident" which is derived from the MCpot 
"President". The hypothesis is also accepted if the 
right word has a genitive ending and occurs without 
this ending in the corpus, because only proper names 
may occur in such constructions, as in (15). Normal 
nouns have to be accompanied by an article, as in 
(16). 

(15) die Strategie Frankreichs 
(the strategy of France) 

(16) die Strategie des Morders 
(the strategy of the murderer) 

A hypothesis of a preposition and a capitalized 
word is rejected, if the capitalized word 

1. is a potential minimal context 

2. is followed by a genitive article 

3. is followed by a past participle. 

The latter two conditions exclude such 
constructions ("feste Syntagmen"), as in: 

(17) aus Anlafi des 

(on the occasion of) 

(18) in Kauf genommen 
(accepted) 

In addition, it is checked whether we have a 
construction like "zu Olims Zeiten", i.e., whether 
the capitalized word has a genitive ending and is 
followed by a capitalized word. For example, we 
found the following proper names: 

(19) in Lafontaines Worten 

(in the words of Lafontaine) 

(20) in Stoltenbergs Bilanz 

(in Stoltenberg's the balance sheet) 

(21) gegen Hitlers Ermachtigungsgesetz 
(against Hitler's Enabling Act) 

All resulting hypotheses are evaluated by another 
procedure which takes into account the AVL-tree 
containing all capitalized words together with the 
distributional information described above. Because 
the corpus is very small and often there is only one 
occurence of a word, this information is not very 
reliable and therefore error-prone. This could be 
improved by the application of the procedure to a 
very large corpus (several million words). At this 



point, it is only checked whether the right word 
occurs with an article (a clue for a normal noun) 
and whether it often occurs with other capitalized 
words or prepositions (a clue for a proper name). 
Proper names are normally not used with articles 
with the exception of ones - mostly cases place 
names and institutional names - which always occur 
with an article (e.g. "die Turkei" , "die Vereinigten 
Staaten"). So, this method has to be used carefully. 

The processing of hypotheses is iterated until no 
more proper names can be found (pn_new = 0), 
because new proper names supply new contexts and 
new contexts may supply new proper names. 

Tagging 

In order to tag the proper names collected in the 
EN-lexicon, it is necessary to run through the corpus 
for a last time. All words listed in the EN-lexicon 
are tagged as proper names. 

The procedure of proper name tagging was 
implemented in C under UNIX. 

4 Evaluation 

The first half of the corpus was used to develop the 
procedure, the second half served for an evaluation. 
For the evaluation, all proper names in the second 
corpus half were manually tagged and (manually) 
compared to the result of the automatic tagging 
procedure applied to this corpus part, i.e., to a 
corpus of 25,000 words. Of the 1300 proper name 
tokens 461 occurrences were not recognized, 30 text 
words were wrongly tagged as proper names. This 
corresponds to a recognition rate of about 65% 
(counting errors not excluded). In order to provide 
background for this figure, some of the problems are 
discussed here in more detail. 

The preprocessing module could be improved 
by enlarging the MC-lexicon with a list of most 
frequently used first names, for example. For the 
recognition of non-German proper names, it could 
be possible to add non-German titles and forms 
of address as well. The latter were also found in 
the corpus (e.g. Captain Alan Stephenson, Lord 
Carrington) . 

At the Moment, first names are collected in the 
MCpot-lexicon if they are used attributively to 
a surname already rec ognized. This is in contrast 

and others 
who 



t o the approaches of ( Fleischer 
((|Wimmer, 197$, ([Kalverkamper, 1978|)), 



analyze first names and surnames as a unit. One 
reason for this is that only the surname can be 
inflected, as in (22). But as this also applies to titles, 
as in (23), the reason does not hold. 



(22) Peter Miillers Auto 


Text 


Hypothesis 


(the car of Peter Muller) 


1 


Militaerf lughaf en Rhein-Main 


1 i/U 1 IVillllo LL.1 k r L// luCI J _1_\AA,L<_. 


2 


Dutzend Personenwagen 




2 
2 


Captain Alan 
Alan Stephenson 


A better argument is that constructions of first 


6 


Mitte April 


name and surname cannot be expanded, e.g., as 


7 


Metern Abstand 


loose appositional constructions. 


11 


Fraktionskollege Albrecht 


The procedure of proper name tagging described 


11 


Albrecht Mueller 


here is not able to recognize multi-word proper 


12 


Kanadische Luf twaf f endivision 


names because only two adjacent capitalized words 


12 


Air Group 


(apposition + proper name) are examined. Table 


12 


Hochleistungsf lugzeug F-18 


2 shows an excerpt of unresolved hypotheses in 


13 


Central Enterprise 


which some multi-word proper names consisting of 


13 


Central Enterprise 


first name and surname (Albrecht Muller, Angelika 


14 


Central Enterprise 


Beer, Harry Ristock, Ruth Winkler, Josef Felder, 


22 


Frecce Tricolori 


Gabi Witt, Florian Gerster, Sepp Binder, Kurt 


22 


Deutsche Rote 


Schumacher), of normal nouns ((das) Deutsche 


22 


Rote Kreuz 


Rote Kreuz, Kleine Brogel, Ewige Lampe) and of 


22 


Dutzend Demonstranten 


some non-German proper names (Alan Stephenson, 


22 


Autobahnzuf ahrt Frankfurt-Sued 


(Canadian) Air Group, Central Enterprise, Frecce 


22 


Luf t sport gruppe Breitscheid/Haiger 


Tricolori, Standardisation Agreement, Acrobatic Full 


23 


Kleine Brogel 


Scale) are found. 


24 


Fraktionskollegin Angelika 


The non- German proper names are often put in 


24 


Angelika Beer 


quotation marks, so this could be an additional 


25 


Ende September 


criterion for the hypotheses evaluation, but cases in 


27 


IG Metall 


which quotation marks are used to emphasize or to 


27 


Harry Ristock 


cite one or more words must be excluded (24). 


27 


Lehrerin Ruth 




27 


Ruth Winkler 


UTL- X i J l WcLIlll VU1 JT cllllrvIllclLllL- 


28 


Regierung Kohl 


( flip "P wams nf "npnir innncorinc" i 


28 


Prozent Kandidatinnen 


Multi-word proper names consisting of normal 


30 


Leitende Oberstaatsanwalt 


nouns or mixed of normal nouns, adjectives, articles, 


30 


Oberstaatsanwalt Sattler 


prepositions and proper names constitute a major 


32 


Frecce Tricolori 


problem. Apart from the fact that adjectives 


34 


Geburtstag Bert 


and prepositions belonging to a proper name are 


34 


Josef Felder 


capitalized, some of these proper names (25) behave 


34 


Gabi Witt 


like normal nouns, i.e., they are inflectional and take 


34 


Ewige Lampe 


an article, but some do not (26)- (28). The latter 


34 


Museumsdorf Muehlendorf 


are mostly used with an introductory apposition 


34 


Florian Gerster 


and often put in quotation marks. For one it is 


34 


Sepp Binder 


difficult to determine which constituents belong 


34 


Kurt Schumacher 


to the proper name, and which do not when the 


35 


Standardisation Agreement 


construction can be modified and reduced as well 


35 


Standardisation Agreement 


(e.g. Vereinigte Staaten von Amerika, die Staaten, 


35 


Acrobatic Full 


die Bundesrepublik, Deutschland). Under the more 


35 


Full Scale 


distributional analysis described here, it is not 


36 


Frecce Tricolori 


possible to recognize them and no easy solution 


36 


Frecce Tricolori 


is possible. In secondly place, it is possible to 


36 


Frecce Tricolori 


recognize them if we know the minimal context (here 


36 


Demokratische Proletarier 


Luftwaffenbasis, Gasthaus, Strafie), which may be 


37 


IG Metall 


resolved if we use a very large corpus, and if we 


37 


IG Chemie 


consider more than one following word and existing 


37 


IG Bergbau 


quotation marks. 


39 


Kanzleramt Erwaegungen 


56 


Partei Ernst 




96 


Bundespartei Stellung 




Table 2 


unresolved hypotheses (excerpt) 



(25) die Vereinigten Staaten und die Bundesrepublik 
Deutschland 

(the United States and the Federal Republic of 
Germany) 

(26) auf der nordbelgischen Luftwaffenbasis Kleine 
Brogel 

(at the North Belgian air force base Kleine 
Brogel) 

(27) ein Teil von ihnen geht [...] ins Gasthaus "Ewige 
Lampe" 

(some of them go to the inn u Ewige Lampe" ) 

(28) ich habe in der StraBe " Am Mariahof gewohnt 
(I have lived in the street "Am Mariahof) 

Some of the remaining hypotheses in Table 2 
are noun pairs consisting of quantity terms and 
normal nouns (29)-(31) or constructions with month 
names (32). Quantity terms could be excluded by an 
exception list and month names could be added to 
the EN-lexicon from the start. 

(29) ein Dutzend Personenwagen/Demonstranten 
(a dozen automobiles/demonstrators) 

(30) mindestens vierzig Prozent Kandidatinnen 
(at least 40 per cent candidates) 

(31) nach Metern Abstand 

(after a distance of some metres) 

(32) Mitte April/Ende September 

(in the middle of April/at the end of September) 



But some of the remaining hypotheses are the 
result of a free German word order, often observed 
in sentences with support verb constructions (34: 
Ernst machen mit (to be serious about), 35: Stellung 
beziehen gegen (to take a stand against)). The 
hypotheses 'Kanzleramt Erwagungen' in sentence 
(33) could be ruled out if the form 'Erwagungen' 
was analyzed as a non-possible inflection form of a 
proper noun and therefore as a normal noun. This 
was not performed by the morphological analysis^], 
because there were no occurrences of 'Erwagung' 
without a plural ending in the corpus. This could 
be improved by the use of a very large corpus or a 
powerful morphological analyzer (e.g. GERTWOL, 
( Koskcnnicmi and Haapalainen, 1994] ) ) . The support 



1 The analysis is based on a very simple mechanism: 
inflectional endings which are not possible for proper 
names are removed from the word under consideration, 
and the remaining form is searched for in the corpus. If 
successful, the word cannot be a proper name and the 
hypothesis is rejected; if not, the hypothesis is kept. 



verb constructions could be excluded if we look for 
typical verbs used in such constructions (machen, 
bringen, nehmen, ...). 

(33) ... war bekanntgeworden, dafi im Kanzleramt 
Erwagungen [...] stattfanden, wie ... 

(... became known that the chancellorship takes 
into consideration ...) 

(34) ... wenn seine Partei Ernst macht mit ... 
(... if his party gets serious about ...) 

(35) ... indem man [...] gegen die Bundespartei 
Stellung bezieht 

(... while taking a stand against the federal 
party) 

Most of the incorrectly tagged proper names are 
the result of the hypotheses processing, because the 
corpus is too small. For example, the evaluation 
of the hypothesis 'ohne Rucksicht' (with no 
consideration) provides 'Riicksicht' as proper name, 
because it also occurs with the preposition 'aus' 
(from), which is frequently used with place names 
and never occurs with an article, but its frequency 
is only 4. This is not representative for a reliable 
conclusion and it is hoped that a very large corpus 
would allow for a better analysis. 

5 Conclusions and Future 
Perspectives 

Most of the known statistically based tagging 
systems are confronted with the problem of proper 
name tagging. In German the problem is not 
only restricted to the disambiguation of sentence- 
initial words but also occurs with sentence-internal 
capitalized words. The procedure of proper name 
tagging described here makes use of a database 
of definite minimal contexts as a starting point 
for an analysis which takes into account both 
morphological and syntactic properties of proper 
names. Furthermore, this local analysis is supported 
by a global analysis regarding all occurrences of 
capitalized words in the corpus. This global analysis 
should be improved by a larger corpus than the one 
used, and a more mean ingful statistic procedure , 
like mutual information ( Church and Hanks, 1990 ). 
However, the central idea of an incremental 
procedure for the collection of proper name contexts 
is encouraging. It is planned to include this proper 
name ta gging in the Ge rman part-of-speech tagger 
Likely ( Feldweg, 1993| ) developed in Tubingen to 
disambiguate all the remaining cases where the 
tagger could not decide between proper name or 
normal noun. 
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